Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-34819][SQL] MapType supports orderable semantics #31967

Closed
wants to merge 5 commits into from

Conversation

WangGuangxin
Copy link
Contributor

What changes were proposed in this pull request?

Currently MapType doesn't support orderable semantics, while it's supported in Hive/Presto. This makes it hard to migrate from Hive to SparkSQL if user have groupby/orderby map type in their sql.

Why are the changes needed?

Generally, we compare two maps by the following steps:

  1. If the size of two maps are not equal, compare them by size.
  2. Otherwise, sort each map entry by map key, then compare two map entries one by one, first compare by key, then value.

We have to specially handle this in grouping/join/window because Spark SQL turns grouping/join/window partition keys into binary UnsafeRow and compare the binary data directly instead of using MapType's ordering. In this case, we have to insert a SortMapKey expression to sort map entry by key. This is very similiar to NormalizeFloatingNumbers

Does this PR introduce any user-facing change?

No

How was this patch tested?

Add more UTs

@github-actions github-actions bot added the SQL label Mar 26, 2021
@WangGuangxin
Copy link
Contributor Author

@hvanhovell @cloud-fan @maropu Could you please help review this?

Copy link
Contributor

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a re-proposal for https://issues.apache.org/jira/browse/SPARK-18134 ? I thought the decision made on that JIRA is we won't support this. But we also encountered this when migrating from Hive to Spark. We worked around this by adding a logical plan rule to covert map to sorted array if needed.

@hvanhovell
Copy link
Contributor

@c21 we should support this. I just ran out of time when I was working on it.

@hvanhovell
Copy link
Contributor

Ok to test

@SparkQA
Copy link

SparkQA commented Mar 26, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41140/

@SparkQA
Copy link

SparkQA commented Mar 26, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41140/

@SparkQA
Copy link

SparkQA commented Mar 26, 2021

Test build #136559 has finished for PR 31967 at commit f60a5c4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 26, 2021

Test build #136556 has finished for PR 31967 at commit a08356a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 26, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41143/

@SparkQA
Copy link

SparkQA commented Mar 26, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41143/

@SparkQA
Copy link

SparkQA commented Mar 28, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41196/

@SparkQA
Copy link

SparkQA commented Mar 28, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41196/

@SparkQA
Copy link

SparkQA commented Mar 28, 2021

Test build #136614 has finished for PR 31967 at commit 58bb3cc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu maropu changed the title [SPAKR-34819][SQL]MapType supports orderable semantics [SPARK-34819][SQL]MapType supports orderable semantics Mar 29, 2021
@maropu
Copy link
Member

maropu commented Mar 29, 2021

Links to the previous PRs: #15970 and #19330

@maropu maropu changed the title [SPARK-34819][SQL]MapType supports orderable semantics [SPARK-34819][SQL] MapType supports orderable semantics Mar 30, 2021
@@ -687,6 +688,118 @@ class CodegenContext extends Logging {
}
"""
s"${addNewFunction(compareFunc, funcCode)}($c1, $c2)"
case _ @ MapType(keyType, valueType, valueContainsNull) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -141,6 +139,27 @@ object NormalizeFloatingNumbers extends Rule[LogicalPlan] {
val function = normalize(lv)
KnownFloatingPointNormalized(ArrayTransform(expr, LambdaFunction(function, Seq(lv))))

case _ if expr.dataType.isInstanceOf[MapType] =>
val MapType(kt, vt, containsNull) = expr.dataType
var normalized = if (needNormalize(kt)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you avoid to use var here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add tests for this new code path in NormalizeFloatingPointNumbersSuite?

*/
object NormalizeMapType extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan transform {
case w: Window if w.partitionSpec.exists(p => needNormalize(p)) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You didn't support BinaryComparison cases?


override def nullSafeEval(input: Any): Any = {
val childMap = input.asInstanceOf[MapData]
val keys = childMap.keyArray()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to sort data recursively just for nested case like map<map<int,int>,string> and map<struct<a: map<int,int>>,string>)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that I missed this case. I'll fix it

}
}

case class SortMapKey(child: Expression) extends UnaryExpression with ExpectsInputTypes {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SortMapKey -> SortMapKeys?

mapBuilder.build()
}

override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make this PR simpler, how about leaving the codegen support into follow-up PRs just like the original PR? https://github.com/apache/spark/pull/15970/files#diff-da163d97a5f0fc534aad719c4a39eca97116f25bfc05b7d8941b342a3ed96036R423-R429

Batch("ReplaceUpdateFieldsExpression", Once, ReplaceUpdateFieldsExpression)


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unnecessary change.

}
}


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove unnecessary blank lines.

* to insert an expression to sort map entries by key.
*
* Note that, this rule must be executed at the end of optimizer, because the optimizer may create
* new joins(the subquery rewrite) and new join conditions(the join reorder).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you leave some comments about why this rule does not handle the Aggregate cases? https://github.com/apache/spark/pull/31967/files#diff-21f071d73070b8257ad76e6e16ec5ed38a13d1278fe94bd42546c258a69f4410R344


test("SPARK-34819: MapType has nesting complex type supports orderable semantics") {
Seq(CodegenObjectFactoryMode.CODEGEN_ONLY.toString,
CodegenObjectFactoryMode.NO_CODEGEN.toString).foreach {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move the two tests into SQLQueryTestSuite? You can use the CONFIG_DIM directive there:
https://github.com/apache/spark/blob/master/sql/core/src/test/resources/sql-tests/inputs/postgreSQL/join.sql#L18-L20

@maropu
Copy link
Member

maropu commented Apr 26, 2021

Any update?

@maropu
Copy link
Member

maropu commented May 13, 2021

@WangGuangxin If you cannot keep working on it, is it okay that I take this over?

@WangGuangxin
Copy link
Contributor Author

@WangGuangxin If you cannot keep working on it, is it okay that I take this over?

Sure, I'm stuck with something else, you can take this over if you have time. Thanks

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Aug 22, 2021
@github-actions github-actions bot closed this Aug 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants